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Abstract 

We consider a binary unsupervised classification problem where each observation 
is associated with an unobserved label that we want to retrieve. More precisely, we 
assume that there are two groups of observation: normal and abnormal. The 'normal' 
observations are coming from a known distribution whereas the distribution of the 
'abnormal' observations is unknown. Several models have been developed to fit this 
unknown distribution. In this paper, we propose an alternative based on a mixture of 
Gaussian distributions. The inference is done within a variational Bayesian framework 
and our aim is to infer the posterior probability of belonging to the class of interest. 
To this end, it makes no sense to estimate the mixture component number since each 
mixture model provides more or less relevant information to the posterior probability 
estimation. By computing a weighted average (named aggregated estimator) over the 
model collection, Bayesian Model Averaging (BMA) is one way of combining models 
in order to account for information provided by each model. The aim is then the es- 
timation of the weights and the posterior probability for one specific model. In this 
work, we derive optimal approximations of these quantities from the variational theory 
and propose other approximations of the weights. To perform our method, we con- 
sider that the data are dependent (Markovian dependency) and hence we consider a 
Hidden Markov Model. A simulation study is carried out to evaluate the accuracy of 
the estimates in terms of classification. We also present an application to the analysis 
of public health surveillance systems. 

Keywords: Model averaging, Variational Bayes inference, Markov Chain, Unsu- 
pervised classification. 



1 Introduction 



Binary unsupervised classification We consider an unsupervised classification prob- 
lem where each observation is associated with an unobserved label that we want to retrieve. 
Such problems occur in a wide variety of domains, such as climate, epidemiology (see Cai 
et al. |17|), or genomics (see McLachlan et al. [IT]) where we want to distinguish 'normal' 
observations from abnormal ones or, equivalently, to distinguish pure noise from signal. In 
such situations, some prior information about the distribution of 'normal' observations, or 
about the distribution of the noise is often available and we want to take advantage of it. 
More precisely, based on observations X = {X{\, we want to retrieve the unknown binary 
labels S = {St} associated with each of them. We assume that 'normal' observations 
(labelled with 0) have distribution eft, whereas 'abnormal' observations (labelled with 1) 
have distribution /. We further assume that the null distribution <p is known, whereas the 
alternative distribution / is not. In a classification perspective, we want to compute 

T t = Pv{S t = 0\X}. (1) 

Bayesian model averaging (BMA) The probability Tt depends on the unknown 
distribution /. Many models can be considered to fit this distribution and we denote 
M = {/ m ; m = 1, . . . , M} a finite collection of such models. As none of these models is 
likely to be the true one, it seems more natural to gather information provided by each of 
them, rather than to try to select the 'best' one. The Bayesian framework is natural for 
this purpose, as we have to deal with model uncertainty. 

Bayesian model averaging (BMA) has been mainly developed by Hoeting et al. (4j and pro- 
vides the general framework of our work. It has been demonstrated that BMA can improve 
predictive performances and parameter estimation in Madigan and Raftery |8j, Madigan 
et al. (71, Raftery et al. |13[|18| or Raftery and Zheng [l4j. Jaakkola and Jordan [5] also 
demonstrated that model averaging provides a gain in terms of classification and fitting. 
The determination of the weight a m associated with each model m when averaging is a 
key ingredient of all these approaches. 

Weight determination As shown in Hoeting et al. [4j the standard Bayesian reasoning 
leads to a m = Pr{M = where M stands for the model. In a classical context, the 

calculation of a m requires one to integrate the joint conditional distribution P(M,Q\X), 
where G is the vector of model parameters, and several approaches can be used. The BIC 
criterion (Schwarz [IB]) is based on a Laplace approximation of this integral, which is ques- 
tionable for small sample sizes. One other classical method is the MCMC (Monte Carlo 
Markov Chain) [I] which samples the distribution and can provide an accurate estimation 
of the joint conditional, but at the cost of huge (sometimes prohibitive) computational 
time. 

In the unsupervised classification context, the problem is even more difficult as we need to 
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integrate the conditional P(M,Q, S\X) since the labels are unobserved. This distribution 
is generally not tractable but, for a given model, Beal and Ghahramani [2] developed a 
variational Bayes strategy to approximate P(@, S\X). Variational techniques aim at min- 
imising the Kullback-Leibler (KL) divergence between P(&, S\X) and an approximated 
distribution Qe,5 (Wainwright and Jordan (19) , Corduneanu and Bishop (3j). Jaakkola 
and Jordan (51 proved that the variational approximation can be improved by using a 
mixture of distributions rather than factorised distribution as the approximating distribu- 
tion. A mixture distribution Qmix is chosen to minimise the KL-divergence with respect 
to P(Q, S\X). Unfortunately, they need to average the log of Q m ix over all the configura- 
tions which leads to untractable computation and a costly algorithm involving a smoothing 
distribution must be implemented. 



Our contribution In this article, we propose variational-based weights for model aver- 
aging, in presence of a Markov dependency between the unobserved labels. We prove that 
these weights are optimal in terms of KL-divergence from the true conditional distribution 
P{M\X). To this end, we optimise the KL-divergence between P(Q, S, M\X) and an ap- 
proximated distribution Qe,S,M (Section [2]). This optimisation problem differs from that 
of Jaakkola and Jordan (see equation 14 in (5]). Based on the approximated distribution 
of P(9, S\M, X), we derive other estimations of the weights. 

We then go back to the specific case of unsupervised classification and consider a collection 
M of mixtures of parametric exponential family distributions (Section [3]). We propose a 
complete inference procedure that does not require any specific development in terms of 
inference algorithm. In order to assess our approach, we propose a simulation study which 
highlights the gain of model averaging in terms of binary classification (Section]!]). We also 
present an application to the analysis of public health surveillance systems (Section [5]). 



2 Variational weights 

2.1 A two-step optimisation problem 

In a Bayesian Model Averaging context, we focus on averaged estimator to account for 
model uncertainty It implies evaluating the conditional distribution: 

P{M\X) = J P(H,M\X)dH, (2) 

where H stands for all hidden variables, that is H = (S, O), and M denotes the model. 



In order to calculate this distribution, we need to compute the joint posterior distri- 
bution of H and M. Due to the latent structure of the problem this is not feasible but 
the mean field/ variational theory allows one to derive an approximation of this distribu- 
tion. It has mainly been developed by Parisi [12] and provides an alternative approach to 
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MCMC for inference problem within a Bayesian framework. The variational approach is 
based on the minimisation of the KL-divergence between P(H, M\X) and an approximated 
distribution Qh M- The optimisation problem can be decomposed as follows: 



min KL{Q H>M \\P(H,M\X)) 

Qh,m 



mm[KL(Q M \\P(M\X)) 

Qm 



+ ^QM(m) min KL(Q H{m \\P(H\X,m)) 



SH\m 



(3) 



This decomposition separates Qm and Qh\m> an d so these optimisations can be realised in- 
dependently. We are mostly interested in Qm which provides an approximation of P{M\X) 
given in Equation [2] Furthermore, since the collection Ai is finite, we do not need to put 
any restriction on the form of Qm and may deal with the weights a m = Qm(w) for each 
mEM. In the following, we will first minimise the KL-divergence with regard to Q m lead- 
ing to weights that depend on Qn\ m . In a second step, we will consider the approximation 
of P(H\X, m). 

2.2 Weight function of any approximation of P(H\X, m) 

We now consider the optimisation of Qm- Proposition |2.1| provides the optimal weights. 



Proposition 2.1 The weights that minimise KL(Qh,m\\P(H, M\^)) with respect to Qm, 
for given distributions {QH\m-> m £ M.}, are 

a m (Q H \ m ) oc P(m) exp[-KL(Q Hlm \\P(H\X, m)) + log P(X\m)], 

with ^ m a m (Q H \ m ) = 1. 

Proof 2.1 KL(Qh,m\ \P{H, M\X)) can be rewritten as: 

f n ru\n i m [ QH\m(h)Q M (m) 1 
E J Q Hlm (h)Q M (m)lo g [ p{h ^ x)/p{x) \ dh 

= Y QH\m{h)QM{m) [log Q H \ m {h) + log Q M (m) + log P{X) - log P(h, m, X)] dh 

m 

QH\m(h) 



Yl (J QH\ m {h)Qwi{m) 



log 



+ log Qm{™) — log P{m) 



dh +\ogP(X) 



P(h,X\m 

= Yl {QmH [KL(Q Hlm \\P{H,X\m)) + log Q M (m) - log P(m)]) + log P{X) 

m 

The miminisation with respect to Qm subject to s ^2 im Qhi{'m) = 1 gives the result. 



Note that if Qhw 
resumes to P(m\X). 



P(H\X,m) then KL-divergence in the exponential is 0, so a r 
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2.3 Weights based on the optimal approximation of P(H\X, m) 
We now derive three different weights from the variational Bayes approximation. 



Full variational approximation To solve the optimisation problem [3] we still need to 
minimise the divergence 

KL{Qij\ m \\P{H\X,m)) for each model m, where H = (S, 0). 

Due to the latent structure, the optimisation cannot be done directly. When P(X, S\Q, M) 
belongs to the exponential family and if P(@\M) is the conjugate prior, the Variational 
Bayes EM (VBEM: Beal and Ghahramani (2l) algorithm allows us to minimise this KL- 
divergence within the class of factorised distributions: Q m = {Qh\t>i '■ Qn\m = Qs\mQe\m}- 
Due to the restriction, the optimal distribution 

Qnfm = arg min KL(Q H \ m \\P(H\X, m)) 

is only an approximation of P(H\X,m). This allows us to define the optimal variational 
weights. 

Corollary 2.1 The weights achieving the optimisation problem^for factorised con- 
ditional distribution Qmm are: 

a^ B oc P(m)exp[- min KL(Q H \ m \\P(H\X,m)) + \agP(X\m)\. 

Plug-in weights The weights a m = Pr{M = m|X} can be estimated by using a plug-in 
estimation based on a direct application of Bayes' theorem. The conditional probabil- 
ity P(m\X) is proportional to P(X\m) that equals to P(X\m, @)P(@\m)/P(&\X, m) for 
any value of 0, which avoids integrating over 5. The distribution resulting from 

the VBEM algorithm is an approximation of P(Q\X,m). Setting at its (approximate) 
posterior mean 6* = Eqvb(0), we define the following plug-in estimate 

a»,P W Wgp) . (4) 



Importance sampling The weights given in Corollary 2.1 are based on an approxima- 
tion of the conditional distribution P{H\X). But, the weights defined in[2]can be estimated 
via importance sampling (Marin and Robert 19]). For any distribution R, we have 



r,, f r,, s P(X\h,m)P(h\m) 

P(m\X) oc / P(m) v 1 ' ' v 1 J R(h)dh. 
J R{h) 



Importance sampling provides an unbiased estimator of P{m\X). The importance function 
R can be chosen to minimise the variance of the estimator. The minimal variance is reached 
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when R(H) equals P{H\X) [9j. Thus, in the variational framework, the approximated 
posterior distribution Q^ m is a natural choice for the importance function R, leading to 
the following weights: 

Although this estimate is unbiased, when the number of observations is large, it may re- 
quire a long computational time to get a reasonably small variance. 



3 Unsupervised classification 
3.1 Binary hidden Markov model 

We now come back to the original binary classification problem with Markov dependence 
between the labels. To this aim we consider a classical hidden Markov model (HMM). We 
assume that {St}i<t<n is a first order Markov chain with transition matrix II = {vr^; i,j = 
0, 1}. The observed data {X t }i<t< n are independent conditionally to the labels. We denote 
(ft the emission distribution in state ('normal') and / the emission distribution in state 
1 ('abnormal'). We recall that the function (ft is known whereas / is unknown and we 
consider the collection Ai = {fm', m = 1, • ■ ■ , M} where f m is a mixture of m components: 

m m 

fm{x) = ^Pkfikix), with y^pfc = 1- 

k=l k=l 

This collection is large as it allows us to fit the data from a two-component mixture (see 
McLachlan et al. [IT]) to a semi-parametric kernel-based density (see Robin et al. |15|). 
When / is approximated by a mixture of m components, the initial binary HMM with 
latent variable S can be rephrased as an (m + l)-state HMM with hidden Markov chain 
{Zf} taking its values in {0, ... , m} with transition matrix 

/ 7T o TTQlPl . . . TTQlPm \ 
TTio TTllPl . . . UllPm 

\ TTlO 7THP! . . . TTnp m J 

The observed data {X t }i<t< n are independent conditionally to the {Z t } with distribution 

X t \Z t ~ (ftz t i 

where (fto = (ft. Hence, we have two latent variables Z and S which correspond to the group 
within the whole mixture and to the binary classification, respectively. 
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3.2 Variational Bayes inference 

The VBEM (Beal and Ghahramani [2]) aims at minimising the KL-divergence in expo- 
nential family/conjugate prior context. The quality of the VBEM estimators has been 



studied in Wang and Titterington ( (221, 23 , [2l]) for mixture models. Wang and Tit 



terington [20] have also studied the quality of variational approximation for state space 
models. The VBEM algorithm has been studied by McGrory and Titterington [lO] for the 
HMM with emission distributions belonging to the exponential family. In these articles, 
the authors have demonstrated the convergence of the variational Bayes estimator to the 
maximum likelihood estimator, at rate 0{l/n). They also show that the covariance matrix 
of the variational Bayes estimators is underestimated compared to the one obtained for the 
maximum likelihood estimators. 

In our case, P(X, S\@, M) does not belong to the exponential family whereas P(X, Z\Q, M) 
does. We will therefore make the inference on the (m + l)-state hidden Markov model in- 
volving Z rather than the binary hidden Markov model involving S. Despite the spe- 
cific form of the transition matrix f2, it does not modify the framework of the expo- 
nential family/conjugate prior. To be specific, logP(X, Z\Q, M) can be decomposed as 
logP(Z|G,M) + log P(X\Z,Q,M) and only the first term involves £1: 



log P{Z\Q,M) = ^J^iV fci log7rii+JVbologiroo + ^iVfcolog7rio 

k=l 3=1 k=l 



+ ^iV 0i log vroi + Z X k log qi + Z w log go 
j=i k=i 



mm m 



+^2^2 Nk i log v 3 + ^2 Zik log pk ' ( 5 ) 

k=0 j=l k=l 

with Nkj = ^2t>2 Zt-i,kZtj and q is the stationary distribution of II. Since log P(Z\Q,M) 
can be written as a scalar product §.u{Z) with $ the vector of parameters and u(Z) the 
vector containing the {Nkj}x<k,j<m an< i the sums over Z, it shows that Z\@,M belongs 
to the exponential family and that this specific form of only affects the updating step of 
hyper-parameters. 

3.3 Model averaging 

For each model m from the collection Ai, the VBEM algorithm provides the optimal 
distributions Q^ m , from which we can derive the three weights defined in Section |2j a m B , 
a^ E and a^f . Based on these weights, we can get an averaged estimate of the distribution 
/: 

/ = ^ ] a mfmi 
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where A corresponds to one of the proposed approaches (VB, PE or IS). Although the 
largest model only involves M components, the averaged distribution is a mixture with 
M(M + l)/2 components. As we are mostly interested in the estimation of the posterior 
probability T% defined in[TJ we similarly define its averaged estimate: 

' ■ ^ Z\m 

m 

where Eqvb (St) corresponds to the expected value of S calculated with the optimal 
variational posterior distribution of Z. This expectation does not depend on A. 

4 Simulation study 

In this section, we study the efficiency of the estimators defined in the previous sections. 
First, we study the accuracy of a VB and a PE in terms of weight estimation. Then, we 
focus on the accuracy from a classification point of view. We therefore liken the averaged 
estimator of the posterior probability Ti to the theoretical one. We also compare the 
averaging approach with a classical two-state HMM and with the HMM which has the 
highest weight calculated with the importance sampling approach, called throughout the 
paper "selected HMM" . 

4.1 Simulation design 

We simulate a binary HMM as described in Section |3j where / is non Gaussian and define 
as the probit transformation of a uniform-distribution on [0, with c E [5, 7, 10, 15]. The 
difficulty of the problem decreases with the parameter c. We also consider four different 
transition matrices which have the same form given by: 



n » = (l(l-tt) l-Z(l-u)) (6) 

where I is the shifting rate which varies from to 1 and u corresponds to the propor- 
tion of the group of interest and is chosen within {0.05,0.1,0.2,0.3}. For each of the 16 
configurations we generate P = 100 samples of size n = 100. The inference is done in 
a semi-homogeneous case: for each simulation condition, we fit a 7-component Gaussian 
mixture with common variance a 2 and mean \x\. for the alternative. In a Bayesian context, 
the parameters are random variables with prior distributions. These distributions are cho- 
sen to be consistent with the exponential conjugate family. We denote by A the precision 
parameter, A = ^j, we have: 

• Transition matrix: For j = 1,2, 7Tj. ~ T>(1, 1). 
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• Mixture proportions: p ~ T>(1, . . . , 1). 

• Precision: A ~ r(0.01, 0.01). 

• Means: /Xfc|A ~ TV" (o, 00 | xA ) 
4.2 Results 

We present the results for I = 0.6. We considered other values for this parameter but the 
performances are almost similar. 



4.2.1 Accuracy of the weight 

We consider the importance sampling as a reference for weight estimation as it provides 
an unbiased estimate of the true weights whatever the approximation. We compared it 
to VB and PE weights by calculating the total variation distance, which quantifies the 
dissimilarity between two distributions a 1 and a 2 : 



5( a \a 2 ) = \Y,W\x)-a 2 (x)\. (7) 



2 

X 

The closer to this distance is, the better the estimation of the weights. 





u 


u 


0.05 


0.1 


0.2 


0.3 


c 


PE 


VB 


PE 


VB 


PE 


VB 


PE 


VB 


5 


0.419 


0.069 


0.370 


0.101 


0.453 


0.120 


0.456 


0.069 


7 


0.438 


0.096 


0.403 


0.101 


0.287 


0.101 


0.257 


0.072 


10 


0.386 


0.092 


0.271 


0.180 


0.232 


0.115 


0.107 


0.092 


15 


0.372 


0.093 


0.303 


0.158 


0.258 


0.129 


0.102 


0.101 



Table 1: Total variation distance between the estimated weights with respect to importance 
sampling for each value of u and c. 



Table [T] shows that VB weights are the closest to IS weights. The total variation distance 
5(a VB , a IS ) is close to whatever the simulation study. In contrast, the PE weights seem 
not to be correct for approximating the true weights except when the two populations are 
well separated. These trends are also brought out when we focus on the weights calculated 
for the P samples given a simulation condition. On average, compared to the PE approach, 
the VB method tends to provide weight estimations close to those of the IS approach. For 
instance, for c = 7 and u = 0.2, they mix three models with a huge weight (~ 0.70) for fi 
and weights around 0.15 for fa and fy. However, the VB method has more stable estimated 
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weights than IS. PE is the more stable approach among the three but it tends to only select 
the two-component model with an average weight around 0.95. 

Conclusion on the weight estimation By directly analysing the weight estimation, 
the similarities between the IS and the VB methods have clearly appeared. The VB method 
provides a good estimation of the true weights which is not the case for PE. Hence, when 
the computational time of the IS method is becoming very high, we get a real advantage 
by using the VB method in terms of weight estimation. 

4.2.2 Accuracy of the posterior probabilities 

Once the weights have been estimated, the averaged estimates of the posterior probabilities 
T t are computed for each approach. The aim of the VB method is to cluster the data 
into two populations. In many cases, these populations are difficult to distinguish but 
some observations are easily classifiable without any statistical approach. Hence, we put 
aside observations with a theoretical probability of belonging to the cluster of interest 
smaller than 0.2 or higher than 0.8. A classical indicator to measure the quality of a given 
classification is the MSE (Mean Square Error) which evaluates the difference between the 
averaged estimate T A of one method of A and the theoretical values . 

P n 

M ^= l -^ l -Y.^-T^)\ (8) 

p=l t=l 

The MSE- 4 - estimation allows us to evaluate the quality of the estimates provided by Model 
m over all datasets p = {1, P} and one approach of A. The smaller the MSE, the better 
the performances are. 

Since we deal with synthetic data, we can look at the best achievable MSE. This aims 
at minimising the MSE within the averaged estimator family to obtain an oracle weight 
We denote this oracle by a* and we have: 

M 

a* = argmin 1 1 - a m f ^ 1 1 2 , (9) 
a 1 

with Em=i a m = 1 and Vm G {1, M}, < a m < 1. The variable f ( m ) is the estimation 
of T supplied by model m. This oracle can be viewed as the weights we would choose if the 
theoretical posterior probability of belonging to the group of interest were known. This 
oracle estimator is obtained by a functional regression under non-negativity constraint and 
it can be written as: 

a* = (f'f)- l f'T^ x 7 (10) 
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u = 0.05 


c 


P£ 


VB 


IS 


SelectedHMM 


Oracle 


5 


0.44(0.04) 


0.36 (0.03) 


0.38(0.04) 


0.42(0.03) 


0.31(0.02) 


( 


0.54(0.04) 


0.42 (0.04) 


0.43(0.04) 


0.47(0.03) 


0.34(0.02) 


1 n 
1U 


0.35(0.04) 


0.30 (0.04) 


0.30 (0.04) 


0.34(0.04) 


0.21(0.03) 


10 


0.38(0.04) 


0.34(0.04) 


0.33 (0.04) 


0.36(0.03) 


0.23(0.03) 




« = 0.1 


c 


PE 


VB 


IS 


SelectedHMM 


Oracle 


5 


0.40(0.04) 


0.37 (0.03) 


0.39(0.03) 


0.39(0.03) 


0.29(0.03) 




0.29(0.03) 


0.23 (0.03) 


0.23 (0.03) 


0.25(0.03) 


0.17(0.02) 


1 n 
1U 


0.28(0.03) 


0.28(0.03) 


0.23 (0.03) 


0.28(0.03) 


0.16(0.02) 


10 


0.25(0.04) 


0.22(0.03) 


0.20 (0.03) 


0.22(0.03) 


0.17(0.02) 




u = 0.2 


c 


PE 


VB 


IS 


SelectedHMM 


Oracle 


5 


0.33(0.03) 


0.29 (0.03) 


0.30(0.03) 


0.31(0.03) 


0.19(0.02) 


7 


0.26(0.03) 


0.23 (0.02) 


0.24(0.02) 


0.25(0.02) 


0.18(0.02) 


10 


0.23(0.03) 


0.20(0.02) 


0.19 (0.02) 


0.23(0.01) 


0.17(0.02) 


15 


0.08(0.01) 


0.09(0.01) 


0.07 (0.01) 


0.09(0.01) 


0.06(0.02) 




u = 0.3 


c 


PE 


VB 


IS 


SelectedHMM 


Oracle 


5 


0.23(0.02) 


0.19 (0.01) 


0.20(0.01) 


0.22(0.01) 


0.16(0.01) 


7 


0.13(0.01) 


0.11 (0.01) 


0.12(0.01) 


0.13(0.01) 


0.09(0.01) 


10 


0.17(0.02) 


0.12(0.01) 


0.11 (0.01) 


0.18(0.01) 


0.03(0.01) 


15 


0.12(0.01) 


0.10(0.01) 


0.09 (0.01) 


0.12(0.01) 


0.06(0.01) 



Table 2: Mean(sd) of the misclassification rate for the three averaging approaches. 



where 7 is a normalising constant and T is the matrix containing the estimates 
for all model m. Several algorithms allow one to calculate this estimator numerically by 
taking constraints into account. In this article, the optimisation has been achieved by the 
Newton-Raphson algorithm. 

Figure [T] displays the MSE calculated for the different methods under the various sim- 
ulation conditions. First, we notice that the VB method based on the optimal variational 
weights provides good results in most of the cases. Moreover, we observe that an averaging 
approach with either the IS or VB method provides better results than the selected HMM. 
We observe that the PE method and the two-state-HMM provide the worse estimates for 
many simulation conditions than do the VB and IS methods. Another comment is that 
there is no method which is the best whatever the simulation condition. Moreover, the 
estimations get closer to the oracle estimator when the problem is becoming easier. 
Figure [2] shows the standard deviation of the MSE over all the simulation conditions. We 
notice that the VB method has one of the lowest variabilities. Once more, the two-state 
HMM has the worst performances. 

Table [2] includes information on the misclassification for the three averaging approaches. 
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4 6 3 10 12 14 

Uniform distribution parameter 



4 6 3 10 12 14 

Uniform distribution parameter 




Figure 1: Mean square error (MSE) between the true posterior probabilities and the esti- 
mates as a function of the uniform parameters. Methods: "A": PE, "V": two-state-HMM, 
"*": IS, Selected HMM, "0": VB, "O" : Oracle. Top left: n . 05 , Top right: n .i, 
Bottom left: II0.2, Bottom right: II0.3. VB and Oracle are in dotted lines. 
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4 6 8 10 12 14 4 B 8 ID 12 14 

Uniform distribution parameter Uniform distribution parameter 




4 6 a 10 12 14 4 B 10 12 14 

Uniform distribution parameter Uniform distribution parameter 



Figure 2: Standard deviation of the MSE. Methods: "A": PE, "V": two-state-HMM, "*": 
IS, Selected HMM, "0": VB, "O" : Oracle. Top left: n . 05 , Top right: n .i, Bottom 
left: TIo.2) Bottom right: II0.3. VB and Oracle are in dotted lines. 
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The misclassification rate is calculated on the P samples whatever the simulation condition. 
The values in bold correspond to the smallest misclassification rate among the PE, VB 
and IS approaches. First, we note that the VB and the IS methods have very similar 
misclassification rates whatever the simulation condition. Moreover, this rate corresponds 
to the best rate of the three averaging methods. The averaged estimator supplied by the 
plug-in weights estimation seems to misclassify more data than the other approaches. Once 
again, Table [2] shows us that the VB approach provides good results when the simulation 
condition is complicated. In fact, when c equals either 5 or 7, the averaging method 
based on optimal variational weights provides the lowest misclassified rate among the three 
averaging approaches. Since the misclassification rate of the oracle is close to the rates 
obtained by VB and IS estimation, the two approaches provide good results for each value 
of c and u. An other comment is that the selected HMM approach always prodives worse 
results than the IS and VB ones. This means that the averaging approach brings a gain to 
the posterior probability estimation. 

Figure [3] shows the entropy of the weights. We note that the optimal variational weights 
have one of the largest entropies among all the proposed weights. This means that the VB 
method tends to mix several models. Contrary to the other three weights, PE has a low 
entropy. This method seems to select only one model to infer posterior probability and 
does not take others into account. 



Conclusion on the accuracy of the estimates Studying the MSE indicator allows 
us to compare the methods in terms of classification. Except for the "two-state-HMM" 
approach, we highlight that all the proposed methods have quite similar behaviours. How- 
ever, the VB method provides better results in terms of MSE and its standard deviation 
than does the PE approach. These results are very close to those of IS and even often bet- 
ter. The focus on the misclassification rate confirmed the closeness between our approach 
and that of IS. These methods have a quite similar misclassification rates whatever the 
simulation condition. Furthermore, this rate corresponds to the best rate among the three 
averaging approaches. The computational time is also a key point of these classification 
methods. Indeed, the VB method has a negligeable computational time compared with IS. 
This may further dramatically increase with the size of the data. 



5 Real data analysis 
5.1 Description 

The data In this section, we focus on the analysis of a real dataset collected from public 
health surveillance systems. These data have also been studied in the recent paper of Cai 



et al. 17 using an FDR (False Discovery Rate) approach. The database is composed of 



1216 time points. The data and log-transformation of them are shown in figure |4j The 
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Figure 3: Entropy of the weights. Methods: "A": PE, "V": two-state-HMM, "*": IS 
Selected HMM, "0": VB, "O" : Oracle. Top left: n . 05 , Top right: n .i, Bottom left 
II0.2, Bottom right: II0.3. VB and Oracle are in dotted lines. 
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m 


mean 


variance 


proportions 


a VB 


1 


4.9 


1.1 


1 


< 10" 4 


2 


( 4.5, 5 ) 


0.9 


( 0.67, 0.33 ) 


< 10" 4 


3 


( 4, 4.2, 6 ) 


0.3 


( 0.32, 0.32, 0.34 ) 


0.34 


4 


( 3.9, 4.1, 5, 6.3 ) 


0.2 


( 0.22, 0.27, 0.26, 0.25 ) 


0.66 


5 


( 3.8, 4, 4.1,5.2, 6.4 ) 


0.18 


( 0.17, 0.19, 0.22, 0.22, 0.20 ) 


< 10~ 4 


6 


( 3.8, 4, 4.1, 4.8, 5.6, 6.5 ) 


0.15 


( 0.14, 0.16, 0.16, 0.20, 0.16, 0.18 ) 


< 10" 4 



Table 3: Parameter estimation of the Gaussian mixture within the alternative distribution 



event described by the data can be classified into 2 groups: usual or unusual. These two 
groups correspond to a regular low rate and an irregular high rate respectively. Hence, the 
first group represents our group of interest and the other one the alternative. Moreover, it 
is clear that an event highly depends on the past and Strat and Carrat (6| demonstrated 
that this kind of data can be described by using a two-state HMM. In this analysis, we 
thus aim at retrieving the two groups in the population and we want to estimate well the 
posterior probability of belonging to the group of interest. 

Initialisation of the algorithm To avoid any influence of the prior distributions, they 



have been chosen as described in Section 4.1 As considered in the simulation section, the 



alternative distribution has been fitted by a Gaussian mixture with common variance. The 
number of components m within the alternative distribution varies from 1 to 6 and the 
fixed distribution M(2.37, 0.76 2 ) has been chosen according to results of Cai et al 
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5.2 Results 

For each number of component we infer the model parameters and estimate the weights 
with the VB method. The results we obtained are summarized in Tabled 

Every model presented in Table [3] has the same estimation of the transition matrix 

In their article, Cai et al. selected a model with two heterogeneous 



0.04 0.96 

Gaussian distributions for the alternative. In our approach, due to the homogeneous as- 
sumption, the number of components increases and we keep two models with three and 
four components respectively. The other models have a low weight, smaller than 10 -4 , and 
have no influence on the posterior probability estimation. We now focus on the classifica- 
tion provided by the averaged distribution and the 3-component model proposed by Cai 
et al. [l~7| and we notice that only 3 points differ between our approach from that of Cai. 
However if we focus on these three points, we observe that they correspond to points with 
a posterior probability close to 0.5. These points are on the borderline between the two 
classes. As our approach tends to increase the posterior probabilities (see figure [5]), the 
epidemical ranges are greater with our approach. In two cases, the epidemics are declared 
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earlier with the VB method than with that of Cai. 

Figure [5] displays the averaged posterior probabilities against the estimations obtained 
by the model proposed by Cai. The first comment is that the two approaches provide 
close estimations. This is especially the case for probabilities smaller than 0.3 or greater 
than 0.7. These ranges correspond to low entropy areas. The main comment is that an 
averaging approach tends to refine posterior probabilities between 0.3 and 0.7. This high 
entropy area is considered as a difficult area for estimating the probabilities. In fact, it 
mainly corresponds to data points which are on the borderline between the two classes. 

6 Conclusion 

We proposed a method for binary classification problems based on averaged estimators 
within a variational Bayesian framework. This approach allows us to avoid model selection 
and take model uncertainty into account. It can theoretically be proved that using an 
averaged estimator provides a gain in terms of MSE and increases the lower bound of the 
log-likelihood. We proposed a method based on optimal variational weights which derive 
from a modification of the classical lower bound of the log-likelihood. Our method does 
not required more computational time than classical one. For studying the performances, 
the method has been used on both synthetic and real data. 

The results we obtained on synthetic data showed that our method enhances the es- 
timator in terms of MSE in many simulation conditions. We also highlighted that the 
averaging approach improves the posterior probability estimation provided by the classical 
selection approach. Moreover, we showed that optimal variational weights are closer to 
importance sampling than the plug-in one. Since the importance sampling coped with 
computational time problems for high dimensional datasets, our method is of significant 
interest in this case. 

A real data analysis has been carried out on a clinical dataset. In this context, the ag- 
gregation model still refines the estimation of posterior probabilities. We note in particular 
that the classification is different in cases where the probability is close to 0.5, i.e. when 
the classification is difficult. It allows us to refine the start of the epidemic period. 
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Figure 4: top: weekly ILI rate, middle: log-transformed weekly ILI rate, bottom: Aggre- 
gated posterior probability of ILI epidemic over weeks. The three red points correspond 
to the points which have a different classification from one method to another. 
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Figure 5: Aggregated posterior probabilities according to the estimation of the posterior 
probabilities with the 2 heterogeneous components model. The three red points correspond 
to the points which have a different classification from one method to another. 
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